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The present Invention relates to command scheduling In computer networks 
and to a network Interface for use In the command scheduling method. Moreover. 
5 the present Invention is particularly, but not exclusively, suited fbr use in large-scale 
parallel processing networks. 

With the Increased demand for scalable system-area networks for cluster 
supercomputers, web-server famis, and network attached storage, the 
interconnection network and It's associated software libraries and hardware have 
10 become critical components In achieving high perfonnance In modem computer 
systems. Key players In high-speed Interconnects Include Gigabit Ethernet (GigE) 
•™. GlgaNet ™. SCI ™. Myrinet ™ and GSN ™. These interconnect solutions differ 
from one another with respect to their architecture, programmability, scalability, 
perfonnance, and ease of Integration into large-scale systems. One factor which is 
critical to the perfonnance of such interconnects Is the management and In particular 
the scheduling of commands across the network. 

With all computers multiple demands are made of both its internal and 
peripheral resources and the scheduling of these multiple demands is a necessary 
procedure. In this respect each task to be executed is assigned to a queue where it 
20 is stored until the required resource becomes available at which point the task Is 
removed from the queue to be pnacessed. The same is true to a much greater 
extent with computer networi<s where are large number of individual tasks, each 
requiring data to be communicated across the networtc, are processed every second. 
How efficient the network Is. depends upon its latency and bandwidth. The lower the 
25 latency of the network and the wWer the bandwkJth, the better the networic 

perfonnance. Latency Is a measure of the time period between the application of a 

stimulus (a request for connection in the networic) and the first Indlcatfon of a 
response from the networic whereas the bandwidth of the networic is a measure of its 
infomiatlon canying capacity. Most networic communications are of Inherently short 
30 duration, of the order of 5 milliseconds or less and the extent to which the duratton of 
such networic communlcattons dan be minimised is a factor In minimising the latency 
of the networic as a whole. 

US 6401 145 describes a system for improving the bandwidth of a networic of 
procisssing nodes. Networic requests are queued in the main m mory fthe 
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processing node for asynchronous transmittal of data t»etween the processing node 
and its networl< Interface. Tvwo queue-sets are used the first queue-set being 
dedicated to Input data and the second queue-set being dedicated to output data. 
Queuing priorities both for the input and output queue-sets are also determined " 
5 according to the importance of the data to be processed or transferred, and a queue 
description record is established. Data Is then transfen^d to or received from the 
netvM3ric interface according to the queuing priority. 

in US 6,141,701 a system for. and method of, off-loading message queuing 
facilities ('MaF") from a mainframe computer to an intelligent input/output device are 
10 described. The Intelligent I/O device Includes a storage controller that has a 
processor and a memory. Stored In the storage controller memory is a 
communication stack for receiving and transmitting information to and from the 
mainframe computer. The storage controller receives I/O commands having 
corresponding addresses and determines whether the i/O command is within a first 
15 set of predetemiined i/O commands. If so, the I/O command is mapped to a 
message queue verb and queue to invoice the MQF. From this, the MQF may 
cooperate with the communication stacic in the storage controller memory to send 
and receive infomiation corresponding to the vert). 

The present invention seeks to provide an Improved method of scheduling 
20 commands to be transmitted between the processing nodes of a networic which Is 
capable of improving the latency and bandwidth of the network in comparison to 
known computer networics. A representative environment for the present invention 
includes but is not limited to a large-scale parallel processing networic 

In accordance with a first aspect of the present invention there Is provided a 
25 computer networi< comprising: - at least two processing nodes each having a 
processor on which one or more user processes are executed and a respective 
networic interface; and a switching networic which operatively connects the at least 
two processing nodes together, each networic interface including a command 
processor and a memory wherein the command processor of said networic Interface 
30 is configured to alfocate exclusively to a user process being executed on the 

processor with which the networic interface Is associated one or more segments of 
addressable memory in said networic Interface memory as a respective one or more 
command queues 

In accordance with a second aspect of the present invention there is provided 
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a network Interface comprising a command processor and a memory wherein the 
command processor of said network Interface Is configured to allocate exclusively to 
a user pnscess being executed on a processor with which the network Interface is 
associated, one or more segments of addressable memory in said network Interface 
memory as a respective one or more command queues. 

in accordance with a third aspect of the present Invention there is provided a 
method of storing and njnning commands issued by a processor having associated 
with It a network interface comprising a command processor and a network interface 
memory, comprising the steps of: the networic interface receiving a request for a 
command queue from a user process being executed on the processor; In response 
to the request aliocating exclusively to the user process a memory segment of the 
network interface memory as a command queue; storing one or more commands 
associated with the user process in said command queue; and running said 
commands in said command queue without further intervention from said processor. 

With the present Invention, command queues are stored in and acttoned from 
the networi< Interface memory without further Intervention from the processor. This is 
possible as each memory region allocated as a command queue is exclusively 
assigned to a particular user process being executed by the processor and so the 
commands Issued by that user process to the networt< interface are stored in a 
command queue specific to that user process. In this way the network Interface Is 
capable of processing command data at rates approaching 1 Gbytes/S and 
delivering latencies from the PCI bus to the network IntertSace of less than 100 nS 
whilst still maintaining the security of the individual user processes. 

An embodiment of the present invention vAW now be described, by way of 
example only, with reference to the accompanying drawings in which: 

Figure 1 is a schematic diagram of a computer network; 

Figure 2 illustrates the functional units of a network interface of the computer 
network in accordance with the present invention; arid 

Figure 3 Illustrates the allocation of memory space in the networic interface 
SDFIAM in accordance with the present invention and three command queue 
pointers used in the command queues of the network interface. 

Figure 1 Illustrates a computer networic 1 which includes a plurality of 



separate processing nodes connected across a switching network 3. Each 
processing node may comprise one or more processors 4 each having its own 
memory 5 and a respective networic interface 2 with which the one or more 
processors 4 communicate across a data communications bus. " 

The computer networi< 1 described above is suitable for use in parallel 
processing systems. Each of the individual processors 4 may be. for example, a 
server processor such as a Compaq ES45. in a large parallel processing system, for 
example, forty or more individual processors may be interconnected with each other 
and with other peripherals such as, but not limited to, printers and scanners. 

As illustrated in Figure 2, the network interface 2 has an input buffer 20 that 
receives data from the network via paired virtual first-in-first out (FIFO) channels 21 . 
In addition, the network interface 2 includes, but is not limited to, the following 
functional units which will be described in greater detail below: a memory 
management unit (IVIMU) 22, a cache 23, a memory 24 preferably SDRAI\/I. a thread 
processor 25, a command processor 26, a short transaction engine (STEM) 27 and a 
DIVIA engine 28 and a scheduler 29. Both the STEM 27 and the DMA engine 28 are 
in data communication with the network interface output 31 to the switching network 
3. The command processor 26 accepts ordered write data from any source. This 
includes burst PIO writes from the processor 4; local writes from the thread 
processor 25; burst writes from a networic interface event that has just fired; write 
data directly from the networi<; and even data written directly from another command 
queue. The command processor 26 is used to control the STEM processor 27, the 
DMA processor 28, and the thread processor 25. It Is also used to generate user 
interrupts to the processor 4 in order, for example, to copy small amounts of data, to 
write control words and to adjust networic cookies. Each of the funcOonai units of the 
networic interi'ace 2 refemed to above is preferably interconnected using many 
separate 64 bit data buses 30. This use of separate paths Increases concurrency 
and reduces data transfer latency. 

The network Interface 2 provides data communications and control 
synchronisation mechanisms that can be used directly from a dient program. That is 
to say, the individual client programs run on the respective processor 4 with which 
the networic interface 2 is connected, are able to issue commands via the networic 
interface 2 directly to the networic 1 as opposed to all such commands being 
processed via the operating system of the processor 4. These mechanisms are 



based upon the network interface's ability to transfer information directly between tlie 
address spaces of groups of cooperating processes, across the network, whilst 
maintaining hardware protection between the process groups. 

Each client program process, herein referred to as a user process, is 
assigned a context value that determines the physical addresses it is permitted to 
access on the networi( Interface (described In detail below). Furthemiore, the 
context value also identifies which remote processes may be communicated with 
over the network and where the processes reside (I.e. at other processing nodes). 
Through the use of pre-assigned address spaces the security of the networi< and the 
protection between process groups Is maintained by the networtc Interface 2. In this 
respect, it should be noted that the user processes do not have direct access to their 
context values, it is the networic interface 2 that manipulates the context values on 
behalf of the user processes. 

In the case of a program being ain in parallel by more than one processing 
node on the networi^ 1 , the individual processes that make up the program are 
assigned to their respective processing nodes and each process is allocated a virtual 
process identification number through which it can be addressed by the other 
processes in the program. The routing details for the program Is then determined 
and a virtual process table is initialised for each context. A virtual process table is 
maintained by the networit interface 2 for each process and contains an entry for 
each user process that makes up the parallel program indexed by their virtual 
process identification number. The virtual process table includes context values to 
be used for remote operations to be carried out at remote processing nodes which 
are hosting the relevant virtual process and routing Infonnation needed to send a 
message from the local processing node to the other remote processing nodes 
hoisting the same virtual process. 

Each user pro«»ss is assigned exclusive rights to one or more virtual memory 
segments In the SDRAM 24 of the nelworic interface and has its own set of one or 
more command queues which are mapped by the networic interface into the pre- 
assigned virtual address space of the process using the relevant context Thus, as 
schematically illustrated In Figure 3, a first part of the addressable space of the 
SDRAM 24 is allocated to storing command queue descriptors 24a and a second 
part 24b of the SDRAM addressable space is allocated to storing the command 
queues. With respect to the second part 24b of the SDRAM separate contiguous 



SDRAM address spaces are allocated to each command queue, three 32, 33, 34 are 
illustrated In Figure 3: The first command queue 32 is a single command queue for a 
first user process which separately has a command queue descriptor 32a mapped to 
a command port. The second and third command queues 33 and 34 are separate 
command queues for a second user process and have respective command queue 
descriptors 33a and 34a. 

The command queue for each user process provides the user process with a 
set of virtual resources including a DMA engine, a STEM, a thread processor and 
intenupt logic. Through the pre-assignment of virtual address space by the network 
interface In the manner described above, the security of the Individual programs 
being processed by the processor 4 is maintained without the need to lnvol<e a 
system call. This ability to circumvent the operating system of the processor 4 
enables the latency of the networtc Interface's operations to be significantly reduced. 

The command queues enable user processes executing on the processor 4 to 
write paclcets directly to the networi< 1 . For example, short packets of up to 31 
transactions, with each transaction being up to 32 64 bit words long, can be sent 
through the command queue mechanism. The packets are typically for control 
purposes or very low latency transfers of small quantities of data rather than the 
transfer of bulk data, which is transferred more efficiently using DMA. As mentioned 
above, each command queue is represented by a 32 byte queue descriptor also held 
In the SDRAM 24. 8 Kbytes of contiguous SDRAM is preferably reserved for the 
queue descriptors. Entries in a command queue are commands represented by one 
or more 64 bit values. The shortest commands may be represented by one 64 bit 
word whereas the longest may be represented by a whole packet with many 
transactions. The commands issued by a user process contain sufficient control 
informatkin for the command processor to carry out retries and conditional 
processing on behalf of the user process. This means that the user process can 
write a sequence of packets to the command queue without waiting for one to be 
acknowledged before sending the next. 

From the perspective of the user process, the command queues are virtual 
resources In the fomn of blocks of write-only memory. The user process makes a 
system call to request a queue of a specified depth and as the assignment of the 
command qu ue by the networic interiace 2 arises from a system call, access to the 
queue is protected. Once the command queue is altocated. the management of the 
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queu becomes the responsibility of the user process. Where the algorithms of a 
user process have a natural limit to the maximum quantity of outstanding woric that is 
issu d to the queue, flow control through the assigned command queue can be 
controlled by the user process always ensuring that work previously queued is 
completed before new work Is Issued. If the maximum amount of work cannot be 
calculated, then the user process may insert a guarded write of a control word to the 
memory space Into the command stream at regular Intervals. Whichever procedure 
is adopted by the user process to avoid overfilling the command queue, if an 
overflow occurs, an error bit in the command queue descriptor Is set. the command 
queue traps and the relevant user process Is signalled. 

From a system perspective, an 8Kbyte an^y of command ports is mapped 
Into the PCI address space. Each command port appears in the user address space 
as an 8 Kbyte page and is mapped into one TLB entry of the main processor's MMU. 
To allocate a queue to a user process, a queue descriptor is mapped to a command 
port, a block of SDRAM of the requested size is reseo/ed to the queue data and the 
user process is given privilege to write to the command port. 

The networic interface driver can directly access the queue descriptor and 
queue data in the SDRAM 24 and vWien a user process write a command to their 
allocated command queue, the command is written directly to the SDRAM, 
bypassing the cache 23. 

Using the scheduler 29, the command processor 26 schedules the command 
queues and preferably maintains a plurality of separate run queues, for example one 
high priority mn queue and one low priority run queue. Command queues that are 
neither empty nor being executed by the command processor are added to one of 
these run queues. The command processor preferably has a 'head of queue' cache 
of 128 64 bit words and a 1 6 entry queue descriptor cache which Is dedicated to the 
queue pointers (described below). This allows separate processors on a SMP node 
to write commands simultaneously to the network interface over a PCI bus without 
significant queue rescheduling overhead. 

Each command queue Is managed by three pointers and each pointer Is 
manipulated by a separate process running in the command processor 28. The 
pointers are Illustrated in figure 3 with respect to the command queue 32. 

The insert pointer 40 points to the back of the command queue where new 
entries are to be inserted. When it reaches the end of the memory space allocated 



for that qu ue. It wraps around to point to the start of the memory space. The insert 
pointer 40 Is managed by an inserter process which receives command writes and 
send them to the command qu ue. The inserter process writes the commands to 
incrementing addresses and after writing a command to the queue ft updates the 
Insert pointer by the size of the command. The Inserter process is only sensitive to 
the order in which data is supplied to it: it does not use the write address to index 
into the queue. The queue Index is supplied solely in the queue descriptor by the 
insert pointer. 

The completed pointer 42 is the true front of the queue, ft is only moved on 
when a command sequence has completed. This means that the sequence cannot 
be executed again should an error, trap or networi^ discard take place. Many 
separate commands may be required in a command sequence (for example Open 
STEN Packet. Send Transaction, Send Transaction. ...Is the command sequence for 
a packet for the STEN processor). When a command sequence has completed 
successfully, the completed pointer is incremented by the size of that command 
sequence. What constitutes the successful completion of a command is defined by 
the command Itself. Additional support can be provided specifically for generating 
packets for the STEN processor 27. 

The extract pointer 41 is a temporary value that is loaded from the completed 
pointer 42 every time a command queue is rescheduled. It points to the command 
value most recently removed from the queue by the command processor's extractor 
processes. The extract pointer 41 is incremented by one for each value taken from 
the queue. If a command fails, the extractor process is descheduled and the 
command queue is put back onto the run queue. When the queue Is rescheduled, 
the extract pointer is reloaded from the completed pointer. 

As mentioned eariler a command queue descriptor is generated and stored in 
SDRAM which contains all the state required to manage the progress of the queue. 
The fields of the command queue descriptor preferably include the following: 

Error bit. This bit becomes set if the Insert pointer advances past the 
completed pointer. I.e. queue overflow. When this bit Is set. ft will cause a trap. 

Priority bit. When this bit Is set. for a particular queue, the queue will run 
with a higher priority than the queues without this bit set. 

Size. This bft denotes the size of the queue which is preferably restricted to a 
set of pemrilssible predetemilned sizes for example: 1 Kbytes, SKbytes. 64Kbyte5 or 
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512Kbytes. 

Trapp dbit This will be set if a command being executed traps. The 
processing node issuing the command then stops all execution of commands until 
the trap state has been extracted. This means that when the processing node 
5 issuing the commands is restarted, the command queue is dropped from the run 
queue and this bit is then cleared. 

Insert pointer. As mentioned above, this is the pointer to the back of the 
command queue where command data is to be inserted into the queue. 

Completed pointer. As mentioned earlier, this is the pointer to the front of 
1 0 the command queue. It is only moved on when the operation is guaranteed to be 
complete. It is not necessarily the pointer to the place the queue Is being read from 
- this is the Extract pointer as described below. 

Restart count bit. This bit is reduced every time the cun^nt pointer is reset 
to the completed pointer. Each time this bit is reduced, it will cause the queue to be 
15 descheduled and another queue scheduled. When ft reaches zero, It will also cause 
the queue to trap. 

Channel not completed bit This is set when the last transaction of a pacl<et 
Is executed. It Is cleared when the completer process moves the completed process 
moves the completed pointer over the packet it is used to determine whether a 
20 packet is to be retransmitted. 

Packet Acknowledgement bit. This 4 bit acknowledgement provides the 
queue packet status. 

Context. As descrllDed earlier, these bits provide a context for all virtual 
memory and virtual process references. 
25 The extract process has additional state Information that is created from the 

queue descriptor when a new command queue is scheduled for execution. This 
state is then discarded when the queue is descheduled. The additional state 
information preferably Includes: 

Extract Pointer. As mentioned earlier, this pointer points to the cunent 
30 command being executed. When a command queue is scheduled for draining, tiie 
Extract pointer is loaded from the Completed pointer. 

Prefetch Pointer. This bit can be used to prefetch ahead new commands if 
the queue data is being read from th SDRAM. 

The command type is preferably encoded in the boti»m bits of tiie first 64 bit 
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value inserted into the command queue with th top bits being retained for command 
data. The command types Include but are not limited to Run Thread, Open STEN 
Packet. Send Transaction, WMteDWord, Copy64bytes, Intenrupt Run DMA. Thus, 
with the comrnand type Run Thread higher level, message passing libraries can be 
5 implemented without the explicit intervention of the processor 4. The thread 

processor 25 can be used for single cycle load and store operations. It is closely 
coupled to the cache 23 which it uses as a data store. Also, the command type 
Open STEN enables short packets to be transmitted into the network 1 by means of 
the STEN processor 27. The STEN processor 27 is particularly optimised for short 
10 read and writes and for protocol control. Preferably, the STEN processor 27 is 
arranged to handle two outstanding packets for each command queue with the 
packets it issues being pipelined to provide very low latencies. Similarly, the 
command type Run DMA enables remote read/write memory openations via the DMA 
engine 28. 

15 As can be seen from the above, the network interface described above and in 

particular the allocation of separate command queues for each user process greatly 
improves the latency of the computer network as It enables the Intervention of the 
processor 4 to be avoided for individual operations. The present invention is 
particulariy suited to implementation in areas such as weather prediction, aerospace 

20 design and gas and oil exploration where high performance computing technology is 
required to solve the complex computations employed. 

The present invention is not limited to the particular features of the network 
interface described above or to the features of the computer network as described. 
Elements of the networic Interface may be omitted or alters, and the scope of the 

25 invention Is to be understood from the appended claims. It is noted in passing that 
an alternative application of the network interface is In large oommunlcatlons 
switching systems. 
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